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ABSTRACT 

Multi-dimensional classification (MDC) is the supervised 
learning problem where an instance may be associated with 
multiple classes, rather than with a single class as in tradi- 
tional binary or multi-class single-dimensional classification 
(SDC) problems. MDC is closely related to multi-task learn- 
ing, and multi-target learning (generally, in the literature, 
multi-target refers to the regression case). Modeling depen- 
dencies between labels allows MDC methods to improve 
their performance at the expense of an increased computa- 
tional cost. In this paper we focus on the classifier chains 
(CC) approach for modeling dependencies. On the one hand, 
the original CC algorithm makes a greedy approximation, and 
is fast but tends to propagate errors down the chain. On the 
other hand, a recent Bayes-optimal method improves the per- 
formance, but is computationally intractable in practice. Here 
we present novel Monte Carlo schemes, both for finding a 
good chain sequence and performing efficient inference. Our 
algorithms remain tractable for high-dimensional data sets 
and obtains the best overall accuracy, as shown on several 
real data sets. 

Index Terms — multi-label; classification; Monte Carlo; 
chaining 

1. INTRODUCTION 

Multi-dimensional classification (MDC) is the supervised 
learning problem where an instance may be associated with 
multiple classes, rather than with a single class as in tra- 
ditional binary or multi-class single-dimensional classifica- 
tion (SDC) problems. MDC is closely related to multi-task 
learning, and multi-target learning (generally, in the litera- 
ture, multi-target refers to the regression case). The recently 
popularised task of multi-label classification (see |[TJ |2] for 
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overviews) can be viewed as a particular case of the multi- 
dimensional problem that only involves binary classes, con- 
sidered as labels that can be turned on (1) or off (0) for any 
data instance. 

The MDC learning context is receiving increased atten- 
tion in the literature, since it arises naturally in a wide variety 
of domains such as text, audio, still images and video, bioin- 
formatics, medical diagnoses ||2][l][3). The main challenge in 
this area is modeling label dependencies without incurring in 
an intractable complexity. 

A basic approach to MLC is to the independent classifiers 
(IC) method, which decomposes the MDC problem into a set 
of SDC problems (one per label) and uses a separate classifier 
for each label variablaH In this way, MDC is turned into a 
series of standard SDC problems that can be solved with any 
off-the-shelf binary classifier (e.g., a logistic regressor or a 
support vector machinal. Unfortunately, although IC has a 
low computational cost, it cannot provide high performance, 
because it does not model dependencies between labels [1,5, 

©HIED. 

In order to model dependencies explicitly, several alterna- 
tive schemes have been proposed, such as the so-called label 
powerset (LP) method \9\. LP considers each potential com- 
bination of labels in the MLC problem as a single label. In 
this way, the multi-label problem is turned into a traditional 
multi-class problem that can be solved using standard meth- 
ods. Unfortunately, given the huge number of class values 
produced by this transformation, this method is usually un- 
feasible for practical application, and suffers from issues like 
overfitting. This was recognised by flSJ [10], which provide 
approximations to the LP scheme that reduce these problems, 
although such methods have been superseded in recent years. 

A more recent idea is using classifier chains (CC), which 
improves the performance of IC and LP by constructing a 
sequence of classifiers that make use of previous outputs of 
the chain. The original CC method, introduced in [6 1 and ex- 

1 We henceforth use the term label throughout to refer generally to a class 
variable; and not necessarily a binary-only tag, as is the case in much of the 
multi-label literature. 

2 support vector machines are naturally binary, but can easily be adapted 
to multi-class by using a pairwise voting scheme, as in | 4 1 



tended in (71 E), makes a greedy approximation, and is fast but 
tends to propagate errors down the chain. Nevertheless, a very 
recent extensive experimental comparison reaffirmed that CC 
is among the highest-performing methods for MLC, and rec- 
ommended it as a benchmark algorithm ifTTl . A CC-based 
Bayes-optimal method, probabilistic classifier chains (PCC), 
has also been recently proposed (7J. However, although it im- 
proves the performance of CC, its computational cost is too 
large for most real-world applications. 

In this paper we introduce a different novel methods that 
attains the performance of PCC, but remains tractable for 
high-dimensional data sets. Our approaches are based on 
a double Monte Carlo optimization technique and, unlike 
all other chain-based methods in the literature, it explicitly 
searches the space of possible chain-sequences during the 
training stage. Hence, predictive performance can be traded 
off for scalability depending on the application. 

2. MULTI-DIMENSIONAL CLASSIFICATION (MDC) 

Let us assume that we have a set of training data composed 
of N labelled examples, V = {(xW.yW)}^, where x^ = 
[x±, . . . ,x^] T is the i-th D-dimensional instance (input), 
with xf G X d for 1 < d < D, and y« = [yf> , y[ l) ] T is 
the z-th example's L x 1 label relevance vector (output), with 

yf £{l,...,K j }=y j , 

where Kj is a finite number of classes, being its j-th class 
assignment. In MDC we seek to learn a function, y = h(x), 
that assigns a vector of labels, 

y 6 x ■ • • x y L , 

to each instance, 

x £ Xi x • • • x X d C R d . 

Let us assume that the true posterior distribution of the data 
is /(y|x). From a Bayesian point of view, the optimal label 
assignment for a given test instance, x*, is provided by the 
maximum a posteriori (MAP) label estimate: 

yMAP = h MAP (x* ) = argmax / (y | x* ) . ( 1 ) 

y 

Unfortunately, the true distribution, /(y|x), is usually un- 
known, and the classifier has to work with an approximation, 
p(y\x), constructed from the training data. Hence, the (pos- 
sibly sub-optimal) label prediction is finally given by 

yW ~ y* = h ( x *) = argmaxp(y|x*). (2) 

y 

Although binary-only multi-label problems can be con- 
sidered as a subset of multi-dimensional problems, the reverse 
is not true, and there are some crucial differences meaning 



that much research in MLC is not directly applicable to MDC. 
In the first instance, there is a higher dimensionality for the 
same L; MLC deals with 2 L possible values, and MDC deals 
with Ylf—i Kj. Moreover, in MDC there are not only more 
possible, but many more likely classifications, due to a quali- 
tative difference between tagging data with binary labels, and 
assigning classes, even if these classes are binary classes. In 
typical MLC problems, the binary classes are used to indi- 
cate relevance (e.g., 1) and irrelevance (e.g., 0). For example, 
the label "beach" may be relevant to a particular picture. In 
practice, on average typically only slightly more than 1/L la- 
bels are relevant to each example HI (see also Table [2]). This 
also means that, the chance of £ labels being relevant to a 
single data instance falls to zero as £ ~ > L (it is very rea- 
sonable to expect that no data instance will be assigned all 
- or even a majority - of labels). This means that we have 
more prior information. In MDC however, classes (including 
binary classes) are used differently, for example indicating 
"male'V'female". Clearly (prior-knowledge of the problem 
aside), we expect the chance of a particular data instance be- 
ing classified "male" to be around 0.5. In summary, in MDC, 
the practical y-space is much greater than in MLC, making 
probabilistic inference more challenging. 

2.1. Independent Classifiers (IC) 

Using independent classifiers (IC) is commonly mentioned 
in the MLC and MDC literature 13 12 16] E) . This approach 
transforms the multi-dimensional problem into L separate 
uni-dimensional (i.e., standard binary or multi-class) prob- 
lems. Hence for each j = 1, . . . , L a classifier hj is employed 
to map new data instances to the relevance of the j-th label, 
i.e., 

Ymap ~ y* = h(x*) = [fti(x*), . . . , h L (x*)}, 

where, probabilistically speaking, we can define for each hj 

as 

Uj = M x *) : = argmaxp(j/j|x*). (3) 

Vi 

As we remarked in Section [T] this method is easy to build 
using off-the-shelf classifiers but it does not explicitly model 
label dependencies, and its performance suffers as a result. In 
fact, in assumes complete independence, i.e., that 

L 

My|x*) = n^l x *)- W 

We always expect some label dependencies in a multi-label 
problem (otherwise we are simply dealing with a collection of 
unrelated problems); some labels occur more likely together, 
or mutually exclusively. It is important to model these de- 
pendencies, because doing so can influence the outcome of 
predictions. 



2.2. Classifier Chains (CC) 

In H, correlation among labels is considered. Classifier 
chains (CC) is based on modeling the correlation among la- 
bels using the chain rule of probability. Given a data instance, 
x*, and a vector of label indexes, s = [s\,..., sl] t , obtained 
as a permutation of {1, ... , L}, p(y|x, s) may be expressed 
afl 

L 

p(y|x*,s) =p(yi\x*)Y[p(y j \x*,y 1 ,...,y j - 1 ), (5) 

i=2 

where y = [y%, . . . , j?l] t is the permuted label vector (see 
Fig. [TJ, i.e., ijj = y Sj is the j-th label in the permutation, 
j = 1, .., L, and the probabilities in |5]) are learnt from the la- 
belled data during the training stage. It is important to remark 
that, theoretically Eq. |5]) does not depend on the label order. 
However, since all the conditional densities in Eq. |5]l are esti- 
mated from the training data, the label order can have a large 
effect in practice, as recognized by [7]. Given a data instance 
x* and a label order s, note that the vector y can be seen as 
a path in a tree with L levels, and p(y|x,s) is the "utility" 
corresponding to this path. Figure [2] depicts an example with 
Kj =2 for all j = l,..,L. 

First of all, CC considers arbitrarily a label order s. Then 
during the test stage, given a instance x*, CC follows a single 
path of labels y* greedily down the chain of L binary clas- 
sifiers, with the j-th classifier, hj, predicting the j-th label's 
relevance, y*, using all previous predictions (y*, . . . ,y*_ 1 ), 
as 

y* = hj(x*\s) = argmaxp(|/j|x*, y{, . . . , (6) 

Vj 

In carrying out classification down a chain in this way, CC 
models label dependencies and, as a result, usually performs 
much better than IC, while being similar in memory and time 
requirements in practice. However, due to its greedy approach 
(i.e., only one path is explored) and depending on the choice 
of s, it is susceptible to errors in the initial links of the chain 
0. Figure [T] depicts an example of sequence of labels s in 
the assumed correlation structure. 

2.3. Probabilistic Classifier Chains (PCC) 

Probabilistic classifier chains (PCC) was introduced in J7|. 
In the training phase, PCC is identical to CC; considering a 
particular order of labels s (either chosen randomly, or as 
per default in the dataset). However, during the test stage 
PCC provides Bayes-optimal inference by exploring all the 
W.a—1 Kj — 2 L possible paths (they consider the simplest 
case Kj = 2 for all j = 1, L). Hence, for a given test 

3 Note that, using the true conditional distributions, Eq. JIJ does not de- 
pend on the label order. Namely, using the true densities Eq. |5J is always 
exact for any choice of s. 



instance, x*, PCC provides the optimum label estimate, ob- 
tained maximizing the label vector, y, rather than the individ- 
ual labels, yj, i.e., 

y* = h(x*|s) = argmaxp(y|x*,s), (7) 
y 

where p(y|x*, s) is given by Q. In [7 | an overall improve- 
ment of P CC over CC is reported, but at the price of high com- 
putational complexity: it is intractable for more than about 10 
labels (= 2 10 paths), which represents the majority of prob- 
lems in the multi-label domain. Moreover, since all the con- 
ditional densities in |5]) are estimated from the training data, 
the results can depend on the chosen label order s. 

2.4. Bayesian Network Classifiers 

In [ 8 ] conditional dependency networks (CDN) are used as a 
way of avoiding choosing a specific label order s. Whereas 
both CC and PCC are dependent on the order that labels ap- 
pear in the chain, CDN avoids this problem: it is a fully con- 
nected network, rather than a chain. This fully connected net- 
work is comprised of L label-nodes p(yj\x, y-j) for nodes 
j = 1, . . . ,L (where y_, = [y u ...,y j _ 1 ,y j+1 , ...,y L ] is the 
vector of all labels except the jth). In this method, Gibbs 
sampling is used for inference over / steps for collecting the 
marginal probabilities. Due to having (L(L — l)/2) links in- 
ference may be problematic for large L. 

1 3 1 presents a more tractable network; finding an approxi- 
mate representation of the dependency structure by using the 
Chow-Liu algorithm. They avoid needing a costly graph in- 
ference (as for example used in JD's CDNs) by treating the 
graph as L trees, where the jth node is the root node of the 
jth tree. However, this method is similar to CC in the sense 
that classification depends on the order of nodes. Unlike CC 
it does not model all dependencies (the dependence between 
leaf variables is not necessarily modelled). 

In the next section, we present a double Monte Carlo op- 
timization technique to both find a good chain sequence and 
efficiently approximate Eq. |5]l at inference time. The result- 
ing method out-competes all the methods presented so far in 
this section without losing tractability to practical problems. 

3. EFFICIENT DOUBLE MONTE CARLO 
TECHNIQUE FOR CLASSIFIER CHAINS 

In chain-based MDC problems, for any given test instance, 
x*, and label order, s, we wish to find the best label-relevance 
vector, y* = [y*, . . . , y* L ], out of the Ylf =1 Kj possible label 
vectors (different paths). However, the best inference on a 
poor model will not be as good as the best inference on a 
good model. Therefore, at training time we also wish to find 
the best chain order or sequence, s = [si, . . . , Sr,], out of the 
L\ possible chains. 



s = [4,1,3,2] 




V\ V2 2/3 2/4 

Fig. 1. Example of correlation structure in a classifier chain with L = 4. In the example, we have s = [si = 4, s 2 = 1, S3 = 
3, s 4 = 2], so that 1/1 = y 4 , y 2 = Du ya = 2/3 and y 4 = y 2 . 




Fig. 2. Considering Ifj = 2 for all j, this figure depicts an example of the Ylf=i Kj = 2 L = 8 paths (L = 3) of sequence of 
labels yj, j = 1, .., L. The best path with probability 0.2160 is shown with dashed lines. 



Unfortunately, the optimal solution of these two problems 
is not feasible for anything but very small values of L; the 
total space is (IX/=i Kj) x L\. Hence, in this section we in- 
troduce an efficient double Monte Carlo strategy for quasi- 
optimal inference in Classifier Chains. We present both a 
tractable label prediction scheme at test time (MCC) and a 
method that performs an additional search for the optimal 
chain sequence at build time (M2CC). 

3.1. Training step: finding the best chain 

In order to obtain the best chain (i.e., the optimal label order) 
during the training step we introduce a payoff functioi^ 

N 

J(s) = ^p(y (i) |x w ,s), (8) 

i=l 

and the optimal sequence, s, is the one that maximizes |8]l 
over the set of LI possible sequences, i.e., 

JV 

s = argmax J(s) = argmax p(y^ , s) . (9) 



4 This is an intuitive payoff function but, clearly, there are other possibil- 
ities. Note that J(s) can be seen as a Monte Carlo approximation of the 

function J(s) = J2i=i I P(y;| x i s ) rfx - 



The exact solution of |9| is intractable even for medium 
values of L. Therefore, we propose using the Monte Carlo 
approach summarized in Algorithm [T] to perform an efficient 
exploration of the label-sequence space. This algorithm starts 
with a randomly chosen label sequence, s , which is then 
modified trying to find local maximum of the payoff function 
at least. 

3.1.1. Basic proposal procedure 

In order to explore the space of s, we require a proposal func- 
tion. In this first algorithm we use the simplest possible pro- 
posal mechanism. Specifically, given a sequence 

Sf-i = [st-i(l) = si, • • • ,St_i(L) = s L ] 

the proposal function 7r(s t |s t _i) consists of choosing uni- 
formly two positions of the label sequence (1 < £, m < L) 
and swapping the labels corresponding to those positions, so 
that s t (£) = s t _i(m) and s t (m) = s t -i(£ — 1). 

3.2. Inference (test) step: finding the best path y* 

In the test step, for a given test instance, x*, for which the 
true label association is unknown, and a label order (either 
estimated for M2CC or randomly chosen for MCC), we wish to 



Algorithm 1 Finding a suitable s 



Input: 

• V = {(x<'),yW)}f =1 : training data 

• 7r(s|s t _i): proposal density 

• T": number of iterations 
Algorithm: 

1. Start with some random sequence, So, and build an ini- 
tial model, p(y|x, s ). 

2. Fort = 1, . . . ,T'\ 

(a) Draw s' ~ 7r(s|s t _i) and build model p(y|x, s'). 

(b) if J(s') > J(8t_l) 

• Sf <— s' accept. 

(c) else 

• St <— St-i reject. 

Output: 

• s = Sj". estimated label sequence. 

Algorithm 2 Finding y* for a given test instance x*. 
Input: 

• x*: test instance. 

• s: label order (estimated or chosen randomly). 

• p(y|x, s): probabilistic model (from training stage). 
Algorithm: 

1. Obtain an initial path, yo, using CC. 

2. Fort = 1,...,T: 

(a) Drawy' ~p(y|x*,s) 

(b) if p(y'|x*,s) >p(y t |x*,s) 

• &t<-y' accept. 

(c) else 

• ft<~ ft-i reject. 

Output: 

• y* = y T : predicted label assignment. 



find the optimal label vector that maximizes Eq. fTJ. In gen- 
eral, this problem can be solved analytically for low values of 
L by exploring all the Ylj=i Kj possible paths, as in the PCC 
method 0. However, when L grows this method quickly 
becomes computationally intractable. Therefore, we propose 
here using the random search Monte Carlo approach shown 
in Algorithm [2] to approximate Eq. Q. This algorithm starts 
from the greedy inference offered by standard CC, draws sam- 
ples yW, i = 1, T according to the modelp(yj |x*, s), and 
provides a predicted label sequence 

y* = argmaxp(y*|x*,s), (10) 

where y$ (1 < t < T) are the accepted samples. Note that 
in Algorithm[2]the candidate vectors y' are all drawn directly 
from our target pdf p(y |x* , s). Indeed, it is easy to draw from, 
since y' is a possible path on a tree (see Figure [2]). From a 
Monte Carlo point of view, this is an important consideration 
that guarantees y* = y^ will have (at least) a high probability 
p(y T |x*,s). 



3.3. First two proposed methods: MCC and M 2 CC 

In order to compare fairly both the performance and the com- 
putational effort, we progressively apply the ideas introduced 
in this section. We define two methods: 

• MCC: given a classifier chain trained on some previously- 
determined sequence of labels s (e.g., randomly as in 
CC, PCC), will infer label sets for all test instances; es- 
sentially a tractable version of PCC, using Algorithm[2] 

• M2CC: in the training step, additionally searches for a 
suitable sequence of labels s, namely we also use Al- 
gorithm [T] Therehore, Algorithm [T] and Algorithm [2] 
jointly outline the entire method M2CC. 

4. ENHANCEMENTS 
4.1. Population of label orders 

Since the payoff function J(s) uses p(y|x,s) that is just an 
approximation of the true data distribution, /(y|x, s), and in 
order to decrease the dependence to the training step (then 
to diminish the overfitting), we can consider a population of 
label orders S = {s^\ s^ 1 ^}. Namely, we can generate 
a random walk in the label order space {1, L} L (using a 
suitable proposal density 7r(s|st_i)) and, after T' iterations, 
take the best M label orders in terms of greater payoff 
function J(sW). The method is detailed in Algorithmbl 

Algorithm 3 Finding a suitable population S — 



(i) 



} 



Input: 

• V = {(xW,yW)}? x : training data 

• 7r(s|s t _i): proposal density 

• T'\ number of iterations 

• M: number of output sequences 
Algorithm: 

1 . Start with some random sequence s 

2. Fori = 1,...,T': 

(a) s' - 7r(-|s t _i) 

(b) if J(s') > J(s t ) 

• s f <— s' accept 

• Wt <— J(s') set 

(c) Otherwise, if J(s') < J(s t ) 

• St <— St-i accept 

• w t <- J(st-i) set 

3. Sort si, ...,sy/ in decreasing order according 
to Wi, Wt 1 and select the first M sequence, 
sW, ...,s( M ). 

Output: 

• 5 = {s( 1 ),...,s( M )}. 

• corresponding weights w^ M \ 

Then, in the test step, we can use the information pro- 
vided by the population S = s( M '}. The trivial 



way, is to run different parallel algorithms to find sequences 
of labelsyW using different i = 1, M, and then tak- 
ing the best one. However, we propose a more sophisticated 
procedure running only one random search to find a suitable 
y* using the entire information in the population S. The cor- 
responding idea is shown in Algorithm[4] 

Algorithm 4 Finding y* given a x* and a population S. 
Input: 

• x* a test instance 

• M sequences s^, s( M ' (obtained in the training 
step) 

• M weights w^ M ^ (obtained in the training 
step) 

• p(y|x, s) (obtained in the training step) 
Algorithm: 

1. yo an initial path using CC's greedy inference 

2. Forti = l,...,Ti: 

(a) Choose a s tl <E {s^, s^ M '} according (pro- 
portionally) to the weights vft\ j = 1, M. 

(b) Set zi = y tl _i. 

(c) Fort 2 = 1,...,T 2 : 

i. z' ~p(z|x*,s tl ) 

ii. if p(z'|x*,s tl ) >p(z t2 |x*,s tl ) 

• z i2+1 <— z' accept 
iii. else 

• z t2+1 <- z t2 reject 

(d) Sety tl = z T2 . 

Output: 

• y* = YTi 



to build, and the sequence space cheaper to explore. We do 
this using the current iteration t as the temperature. 

Algorithm|5]details our new proposal density, which grad- 
ually freezes the sequence over time. 

Algorithm 5 Improved proposal density 7r(s|s t _i) 
Input: 

• St-%: previous label order, i.e., s f _i is a sequence 
s f _i = [sx, Sl] of L indices, at the iteration t — 1 

• if: a time parameter 
Proposal procedure: 

1. Choose a value St-i(i) — s,: where i 6 {1, ...,£} ac- 
cording to the with probability 



t < t\ 



I AT' 

Pi,t OC < §t 



(ID 



where j3 > is a constant. 
2. Choose a value s t _ 1 (k) = where k ^ i according 
to the with probability 



Pk,t oc 



3. Set s t = St_i and then set s t (fc) = s t _i(i) and s t (i) 
St-i(k). 
Output: 




4.2. Improved proposal density 

Searching the sequence space requires building a classifier for 
each sequence we want to try, and is thus inherently much 
more expensive than searching the label space. This cost re- 
stricts how many sequences we can look at. Furthermore, 
we note that changing the initial 'links' in the chain (i.e., 
si, S2, . . .) implies a larger jump in the space than changing 
the final links (i.e., . . . , sl-i, slX d ue to the chain structure. 

In light of these observations, we propose an improvement 
of the proposal 7r(s|s t _i) that tries a new sequence s' based 
on the previous; where the links in the chain are progressively 
frozen from beginning to end. Building s' implies a complex- 
ity of L (classifiers to train). However, given a sequence s 
which we have already built the complexity is L — £ where I 
is the highest number where the following holds: 

where 1 < I < L. Thus, by freezing the initial links, the prob- 
ability of £ — >• L increases, and thus the sequence is cheaper 



4.3. Enhanced Methods EM 2 CC and E e M 2 CC 

We have already described the MCC and the M2CC. Now we 
also define other techniques: 

• EM 2 CC: In this method, we use both the Algorithm [3] 
and the Algorithm|4] jointly. However, we still use the 
trivial proposal density 7r(s|s f _i) described in Section 
ECU 

• E e M2CC: In this case, another time we use both the Al- 
gorithm [3] and the Algorithm [4] and the improved pro- 
posal 7r(s|s t _i) described in Algorithm[5] 

5. EXPERIMENTS 

We perform experiments on a collection of real world datasets 
familiar to the multi-label literature |6]|5j|2l, e.g., see Table|2] 
We compare to baseline IC |9], the original classifier 
chains method CC [6|, the Bayes-optimal rendition PCC Q; 
and also the conditional dependency networks method CDN 
of O under / = 1000 total iterations. See Section|2]and the 
references therein for more information on these methods. 
For our methods MCC and M 2 CC, we use T = 100 (inference 



y-step) and just T' = 10 for M2CC (training s-step). Clearly, 
better results can be easily obtained increasing T'. 

As a base classifier we use Support Vector Machines 
(SVMs) fitted with logistic models (as according to fl4j). 
Logistic Regression has been a popular choice in the prob- 
abilistic multi-label literature [7, 8| due to its probabilistic 
output. However, we have found that SVM-based methods 
perform much better in comparison. By default an SVM will 
not provide probabilistic output (i.e., hj(x) G [0, 1]) and for 
this reason we fit the logistic models. 

All methods are implemented and will be made available 
within the MEKA frameworlQ an open-source framework 
based on the WEKA machine learning framework [ 12 1 with 
added support for multi-label classification and evaluation. 



5.1. Evaluation Measures 



5.3. Discussion 

• As claimed in the literature, CC improves over IC con- 
siderable in EXACT MATCH and is similar under HAM- 
MING SCORE. 

• PCC in turn improves on CC - that is, in the cases where 
it is tractable - also across all evaluation measures. 

• Our MCC method outperforms CC on almost every oc- 
casion. 

• MCC is identical to PCC on all datasets that it finishes 
on, but is much more tractable than PCC. 

• M2CC obtains similar performance to MCC on these 
datasets overall. 



Multi-label evaluation measures can be categorised as either 
label-based measures that evaluate the performance of each 
label individually (and take an average across all labels); 
or example-based measures that evaluate the performance of 
label-classification vectors as if each were a single class value 
0. Of the latter, we use the standard exact match measure 
where, given predictions y* and ground truths y: 



• EM2CC obtains best performance of all methods. 

• E e M2CC obtains performance almost as good as EM2CC, 
but is more efficient on most datasets. It is not more 
efficient on some of the large datasets, indicating that 
there may be some overhead in this particular imple- 
mentation sensitive to L. 



1 N 

EXACT MATCH = — ^ /(y W = y* W ) 



i=l 



To contrast with this measure, we use the well-known 
label-based Hamming scor^\ 



N L 



HAMMING SCORE = ^ £ ^ = ^ 

t=l 3=1 

Note that /(•) is simply an identity function in both cases. 



• Our methods are generally faster than CDN, especially 
for larger L. 

• ECC improves over CC, but even these 10 CC models 
are unable to compete with our enhanced methods. 

• There is clearly a qualitative difference between the 
multi-dimensional datasets, and the sub-problems of bi- 
nary multi-label datasets. 



6. CONCLUSIONS AND FUTURE WORK 



5.2. Results 

We carry out 5-fold cross validation. Results for predictive 
performance are displayed in Table [3] Results for running 
time performance are given in Table B] 

We have also provided the ranks and average ranks of each 
method and significance results according to the Nemenyi test 
lfT3"lk a >~ b indicates tat algorithm a is significantly better 
than b (under a p- value of 0.10). 

The original CC paper [6| also presented CC in Bagging 
ensembles (ECC) to improve predictive performance. We ad- 
ditionally compare to this method. 



http : //meka . source forge . net 
6 Often posed as Hamming loss; i.e., 1— h 



amming score 



We introduced a novel double Monte Carlo technique for 
multi -dimensional learning using classifier chains. A Monte 
Carlo search technique is used both to efficiently search the 
label-path space at inference time and also the chain-sequence 
space at training time. We show with an extensive empirical 
evaluation that using our techniques results in better pre- 
dictive performance than related methods while remaining 
computationally tractable. Our model convincingly obtains 
overall best predictive performance of all the methods we 
looked at, and proves tractable enough for real-world appli- 
cations. 

In future work, we intend to look at more advanced ran- 
dom search algorithms and dependency structures other than 
chain models (see Eq. Q), as well different payoff functions 
to evaluate the sequence of labels. 



Table 1. The methods we consider and their parameters. The novel methods we present are below the middle line; where each 
inherits the parameters of the previous, e.g., E e M 2 CC takes parameters T = 100, V = 50, M = 10, ft = 0.03. For CDN, T c is 
the number of collection iterations. For ECC, M is the number of models. 



Key 



Method 



Parameters 



Reference 



IC Independent Classifiers 

CC Classifier Chains 

ECC Ensembles of Classifier Chains 

PCC Probabilistic Classifier Chains 

CDN Conditional Dependency Networks 



[9| 


M = 10 i) 



T = 1000, T c = 100 i) 



MCC Monte Carlo Optimization (y-space) for CC T = 100 

M 2 CC 2x Monte Carlo Optimization (y-space, s-space) for CC T" = 50 

EM 2 CC Enhanced M 2 CC (with s-populations) for CC M = 10 

E e M 2 CC Enhanced M 2 CC (efficient populations) for CC /? = 0.03 



Algorithm 1 
Algorithm 2 
Algorithm 5 
Algorithm 3" 



Table 2. A collection of datasets and associated statistics, where LC is label cardinality: the average number of labels relevant 
to each example; relevant for binary labels. We have divided multi-dimensional datasets, and multi-label (binary-only) datasets. 





N 


L 


K 


d 


LC 


Type 


Solar Flare 


323 


3 


5 


10 


N/A 


astrology 


Bridges 


107 


5 


2-6 


7 


N/A 


civil engineering 


Thyroid 


9172 


7 


2-5 


28 


N/A 


medical 


Parkinson's 


488 


5 


3 


58 


N/A 


medical 


Music 


593 


6 


2 


72 


1.87 


audio 


Scene 


2407 


6 


2 


294 


1.07 


image 


Yeast 


2417 


14 


2 


103 


4.24 


biology 


Genbase 


661 


27 


2 


1185 


1.25 


biology 


Medical 


978 


45 


2 


1449 


1.25 


medical/text 


Enron 


1702 


53 


2 


1001 


3.38 


email/text 


Reuters 


6000 


103 


2 


500 


1.46 


news/text 
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Table 3. Predictive Performance from 5-fold CV, displayed as: value (rank), i.e., the average value across all folds and the rank 
of that value for each dataset. 

EXACT MATCH 



Dataset 


IC 


cc 


PCC 


ECC 


CDN 


MCC 


M 2 CC 


EM 2 CC 


E e M 2 CC 


SolFlare 


0.774 (7) 


0.796 (1) 


0.780 (5) 


0.690 (8) 


0.591 (9) 


0.780 (5) 


0.786 (2) 


0.786 (2) 


0.786 (2) 


Bridges 


0.094 (9) 


0.122 (4) 


0.122 (4) 


0.103 (8) 


0.140 (l) 


0.122 (4) 


0.130 (2) 


0.130 (2) 


0.111 (7) 


Parkins 


0.174(1) 


0.172 (2) 


0.164 (5) 


0.162 (8) 


0.164 (5) 


0.164 (5) 


0.162 (8) 


0.170(4) 


0.172 (2) 


Thyroid 


0.835 (6) 


0.016 (9) 


0.842 (3) 


0.820 (7) 


0.783 (8) 


0.842 (3) 


0.842 (3) 


0.845 (2) 


0.846 (1) 


Music 


0.299 (7) 


0.287 (9) 


0.346 (4) 


0.314(6) 


0.297 (8) 


0.346 (4) 


0.358 (3) 


0.370(1) 


0.367 (2) 


Scene 


0.538 (8) 


0.545 (7) 


0.636 (3) 


0.608 (6) 


0.53 1 (9) 


0.636 (3) 


0.632 (5) 


0.677 (2) 


0.685 (1) 


Yeast 


0.140 (7) 


0.151 (6) 


DNF 


0.186 (5) 


0.069 (8) 


0.209 (4) 


0.220 (3) 


0.225 (1) 


0.223 (2) 


Genbase 


0.941 (8) 


0.964 (2) 


DNF 


0.945 (6) 


0.945 (6) 


0.964 (2) 


0.964 (2) 


0.962 (5) 


0.965 (1) 


Medical 


0.585 (8) 


0.622 (4) 


DNF 


0.643 (1) 


0.602 (7) 


0.629 (2) 


0.625 (3) 


0.621 (5) 


0.604 (6) 


Enron 


0.065 (8) 


0.099 (3) 


DNF 


0.112(1) 


0.073 (7) 


0.101 (2) 


0.087 (6) 


0.093 (5) 


0.096 (4) 


Reuters 


0.287 (7) 


0.346 (6) 


DNF 


0.364 (5) 


0.27 1 (8) 


0.366 (4) 


0.371 (1) 


0.371 (1) 


0.367 (3) 


avg. rank 


6.91 


4.82 


4.00 


5.55 


6.91 


3.45 


3.45 


2.73 


2.82 



Nemenyi significance: MCOIC; MCOCDN; M 2 CC>~IC; M 2 CC>~CDN; EM 2 COIC; EM 2 CC>~CDN; E e M 2 CC^IC; 



E e M 2 CC>~CDN; 



Hamming score 



Dataset 


IC 


CC 




PCC 


ECC 


CDN 


MCC 


M 2 CC 


EM 2 CC 


E e M 2 CC 


SolFlare 


0.901 (7) 


0.916(1) 





903 (4) 


0.846 (8) 


0.772 (9) 


0.903 (4) 


0.902 (6) 


0.904 (2) 


0.904 (2) 


Bridges 


0.634 (7) 


0.664 (3) 





666 (1) 


0.635 (6) 


0.616(9) 


0.666 (1) 


0.650 (4) 


0.650 (4) 


0.633 (8) 


Parkins 


0.681 (2) 


0.684(1) 





673 (7) 


0.679 (3) 


0.679 (3) 


0.673 (7) 


0.673 (7) 


0.679 (3) 


0.678 (6) 


Thyroid 


0.974(1) 


0.829 (9) 





974(1) 


0.970 (7) 


0.961 (8) 


0.974(1) 


0.973 (6) 


0.974(1) 


0.974(1) 


Music 


0.808 (4) 


0.789 (8) 





802 (6) 


0.805 (5) 


0.788 (9) 


0.802 (6) 


0.810(3) 


0.811 (1) 


0.811 (1) 


Scene 


0.891 (7) 


0.865 (8) 





894 (4) 


0.897 (3) 


0.857 (9) 


0.894 (4) 


0.892 (6) 


0.904 (2) 


0.907 (1) 


Yeast 


0.790(1) 


0.752 (7) 




DNF 


0.788 (3) 


0.718 (8) 


0.783 (6) 


0.787 (4) 


0.789 (2) 


0.787 (4) 


Genbase 


0.997 (7) 


0.999 (1) 




DNF 


0.997 (7) 


0.998 (5) 


0.999 (1) 


0.999 (1) 


0.998 (5) 


0.999 (1) 


Medical 


0.987 (5) 


0.988 (2) 




DNF 


0.989 (1) 


0.986 (8) 


0.988 (2) 


0.988 (2) 


0.987 (5) 


0.987 (5) 


Enron 


0.925 (2) 


0.922 (5) 




DNF 


0.939 (1) 


0.924 (3) 


0.922 (5) 


0.921 (8) 


0.922 (5) 


0.924 (3) 


Reuters 


0.985 (1) 


0.985 (1) 




DNF 


0.985 (1) 


0.983 (8) 


0.985 (1) 


0.985 (1) 


0.985 (1) 


0.985 (1) 


avg. rank 


4.00 


4.18 




3.83 


4.09 


7.18 


3.45 


4.36 


2.82 


3.00 



Nemenyi significance: PCC^CDN; MCC^CDN; EM 2 CC^CDN; E e M 2 CC^CDN; 



Table 4. Running time performance (in seconds); average of 5-fold CV. 



Dataset 


IC 


CC 


PCC 


ECC 


CDN 


MCC 


M 2 CC 


EM 2 CC 


E e M 2 CC 


SolFlare 


0(9) 


0(8) 


1(6) 


3(5) 


7(4) 


1 (7) 


13 (1) 


7(3) 


7(2) 


Bridges 


1(8) 


0(9) 


2(6) 


5(5) 


8(4) 


1 (7) 


22(1) 


13(2) 


12(3) 


Parkins 


2(9) 


2(8) 


5(6) 


9(5) 


19(4) 


4(7) 


80(1) 


50 (2) 


44(3) 


Thyroid 


28 (9) 


31(8) 


820 (4) 


126 (6) 


208 (5) 


45(7) 


1417(1) 


1004 (2) 


900 (3) 


Music 


0(9) 


0(8) 


1(7) 


2(6) 


6(4) 


5(5) 


45 (1) 


18(2) 


10(3) 


Scene 


12(8) 


11(9) 


15(7) 


44(6) 


92(4) 


90 (5) 


1347(1) 


684 (2) 


335 (3) 


Yeast 


11 (8) 


11(7) 


DNF 


66(6) 


88 (5) 


149 (4) 


1313 (1) 


731 (2) 


546 (3) 


Genbase 


11 (7) 


8(8) 


DNF 


56 (6) 


573 (5) 


1695 (2) 


5287 (1) 


774 (4) 


823 (3) 


Medical 


9(8) 


11(7) 


DNF 


86 (6) 


1546 (3) 


3420 (2) 


6940(1) 


1038 (5) 


1192 (4) 


Enron 


102 (7) 


92 (8) 


DNF 


349 (6) 


3091 (4) 


3884 (2) 


10821 (1) 


2986 (5) 


3470 (3) 


Reuters 


106 (8) 


120 (7) 


DNF 


20593 (1) 


14735 (3) 


1837 (2) 


5740 (4) 


4890 (6) 


5310(5) 


avg. rank 


8.18 


7.91 


6.00 


5.27 


4.09 


4.55 


1.27 


3.18 


3.18 
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